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ITEM  CHARACTERISTIC  CURVE  PARAMETERS: 
EFFECTS  OF  SAMPLE  SIZE  ON  UNEAR  EQUATING 


L  lOTRODUCnON 

The  application  of  the  technology  of  computer  driven  adaptive  testing  requires  the 
development  of  large  banks  of  test  items.  Each  bank  may  contain  2S0  to  400  items,  and  all  must 
measure  the  same  ability  on  the  same  metric  or  scale.  It  is  unreasonable  and  impracticable  to 
assemble  a  single  group  of  2,000  subjects  for  250  to  400  minutes  to  try  all  the  items;  therefore, 
a  method  for  linking  together  subsets  of  items  administered  to  varying  groups  must  be 
investigated.  Item  Characteristic  Curve  (ICC)  theory  offers  a  unique  method  of  linking  subsets  of 
test  items  due  to  the  invariance  property  of  the  ICC  parameters.  This  invariance  property  rests  on 
the  two  major  theoretical  assumptions  of  latent-trait  theory:  (a)  unidimensionality  and  (b)  local 
independence.  Unidimensionality  means  that  only  a  single  idrility  is  being  measured  and  is  assumed 
to  be  the  property  of  an  item  pool,  even  when  assembled  into  subsets.  Local  independence  means 
that  the  subjects’  responses  to  an  item  are  independent  of  the  responses  to  another  item.  More 
simply  put,  this  means  that  the  item  response  is  a  function  of  ability  and  no  other  factor.  In 
effect,  this  is  a  restatement  of  the  unidimensionality  assumption.  If  an  item  pool  is 
unidimensional,  then  any  shift  in  score  metric  that  is  due  to  a  linear  transformation  may  be 
corrected  or  adjusted  by  application  of  the  proper  complementary  linear  transformation.  This  is 
what  is  meant  by  the  idea  that  latent-trait  parameters  are  invariant  to  a  linear  transformation,  and 
it  is  this  theoretical  property  that  allows  item  pools  to  be  linked  and  transformed  to  a  common 
metric.  In  previous  research  efforts,  item  pools  have  been  linked  via  the  method  of  linear  equating 
(see  Lord,  1977;  Ree,  1977;  Sympson  &  Ree,  in  press)  with  apparent  success.  To  date,  there  has 
been  little  research  on  the  efficacy  of  these  linking  procedures  and  the  effects  of  errors  in  ICC 
parameter  estimation  on  their  (linearly)  transformed  values. 

ICC  Panmeten 

The  three  parameter  logistic  model  of  Birnbaum  (Lord  &  Novick,  1968)  is  the  most 
frequently  used  for  relating  item  responses  to  subjects'  ability.  The  three  parameters,  a,  b,  and  c, 
are  item  discrimination,  item  difficulty  (or  location),  and  probability  of  chance  success  (or  lower 
asymptote),  respectively. 

The  curve  described  by  these  parameters  takes  the  shape  of  an  ogive  (cumulative  frequency) 
or  an  “s”  with  the  upper  asymptote  approaching  a  probability  of  1.0  and  usually  a  lower 
asymptote  of  a  probability  greater  than  0.0.  The  ogive  describes  the  probability  of  obtaining  a 
correct  answer  to  an  item  as  a  monotonic  increasing  function  of  ability. 

The  item  discrimination  parameter,  a.  is  a  function  of  the  slope  of  the  ICC  and  generally 
ranges  from  .5  to  about  2.5.  The  value  of  a  equal  to  about  I.O  is  typical  of  many  test  items, 
while  a  values  below  5  are  insufficiently  discriminating  for  most  testing  purposes,  and  a  values 
above  2.0  are  infrequently  found. 

The  item  difficulty  parameter,  b,  describes  the  point  of  inflection  of  the  ICC  and  is  usually 
scaled  between  -2.5  and  +2.5,  although  the  metric  is  arbitrary. 

The  item  guessing  parameter,  c,  is  the  lower  asymptote  of  the  ICC  and  is  generally 
conceived  as  the  probability  of  selecting  the  correct  item-option  by  chance  alone.  Most  test  items 
have  c  parameters  greater  than  0.0  and  less  than  or  equal  to  .30. 

Figure  I  shows  three  ICCs.  The  horizontal  axis  is  scaled  in  units  of  ability  0  and  the 
vertical  axis  is  the  probability  of  answering  the  item  correctly.  The  solid  curved  line  shows  an  ICC 
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for  an  item  of  average  difiiculty  wtth  acceptable  discrimination  and  the  lower  asymptote 
appropriate  for  a  five-item  multiple-choice  item.  The  dashed  line  shows  an  item  of  identical 
difficulty,  c  value  of  .28,  but  with  a  lower  a  value.  Note  how  the  slope  of  the  curve  is  less 
steep.  The  third  curve,  dot-da^  line,  shows  an  item  with  a  r  value  of  .30,  an  a  parameter  of  1.0, 
and  the  b  parameter  equal  to  1.0.  As  the  b  parameter  changes,  the  location  of  the  inflection 
point  of  the  curve  is  displaced  along  the  horizontal  axis. 

Equation  1  presents  the  mathematical  function  describing  the  curve. 


P(®)j  =  Cj  +  (1  -  CjXI  +  l  (tf 


A ))  -1 
■  ) 


(I) 


Previous  research  (Ree,  1978)  indicates  that  the  ICC  parameters  may  be  estimated  with  some 
reasonable  degree  of  accuracy,  providing  a  sufficient  sample  of  examinees  with  an  appropriate 
distribution  of  ability,  d  is  available. 

Linking  Paradigms 

Two  fundamental  linking  procedures  may  be  defined  and  are  known  as  the  Anchor  Items 
Method  (AIM)  and  the  Anchor  Subjects  Method  (ASM).  In  AIM,  every  subset  of  items  is 
administered  to  a  different  sample  of  subjects,  but  embedded  into  the  group  of  items  to  be 
analyzed  is  a  common  (or  anchor)  set  of  items.  During  analysis,  the  anchor  items  are  identified, 
and  the  following  linear  transformation  is  applied  to  the  resultant  ICC  parameters: 


Where  is  the  item  location  parameter  transformed  to  the  desired  scale  and  sb^  and  sb^  are 
standard  deviations  of  the  desired  scale  and  observed  scale  respectively.  A  similar  procedure  for  the 
a  parameter  is  defined  by 


(3) 


Where  a^  is  the  item  discrimination  parameter  transformed  to  the  desired  scale,  a,  is  the  observed  a 
parameter,  and  sb^  and  sb^  are  as  in  equation  (2).  Because  the  <  parameter  is  measured  on  the 
probability  axis,  it  does  not  change  and  no  transformation  need  be  applied. 

The  ASM  requires  that  the  same  group  of  subjects  be  available  to  take  each  subset  of  items. 
It  is  extremely  unlikely  that  the  same  2,000  subjects  could  be  assembled  to  take  items  over  a 
long  period  of  time  as  would  be  required  to  place  tests  on  the  same  metric  from  year  to  year. 
For  this  reason,  the  ASM  method  seems  less  likely  to  find  long-term  practical  application.  Because 
of  its  potential  for  use,  the  AIM  procedure  is  the  subject  of  the  present  study. 


II.  METHOD 

In  order  to  have  a  known  standard  for  reference,  a  simulation  study  was  run  using  two 
groups  of  subjects,  a  single  set  of  20  anchor  items  and  two  differing  groups  of  60  experimental, 
or  non-anchor,  items.  These  two  groups  of  items  were  assembled  into  two  tests  designated  T1  and 
T2.  Both  groups  of  simulated  subjects  were  specified  to  have  about  the  same  normal  distribution 
of  9.  Table  1  shows  the  mean,  standard  deviation,  minimum  and  maximum  of  9  for  the  groups  SI 
and  S2.  Tliese  two  groups  represent  what  might  be  expected  if  subjects  for  experimental  testing 
were  picked  from  some  larger  pool,  such  as  candidates  for  mDitary  enlistment  for  example. 
Response  vectors  for  these  subjects  were  generated  on  the  two  tests. 
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Table  1.  Mean,  Standard  Deviation, 
Minimum  and  Maximum  of  0 
for  Groups  SI  and  S2 


Groups 


r*ram*t«r 

SI 

S2 

0.0145 

0.0250 

0.9976 

1.0045 

Minimum 

-2.6000 

-2.6000 

Maximum 

2.6000 

2.6000 

Generation  of  Item  Responses 

In  order  to  generate  a  vector  of  item  responses  for  each  “subject"  the  6  values  were  used  in 
equation  (I)  to  compute  the  likelihood  of  “passing"  each  item. 

Because  Equation  I  yields  a  number  P(0).  such  that  0.0  <  P(0)j  <  1 .0.  a  number  is 

drawn  from  a  uniform  (rectangular)  distribution  ranging  from  0.0  to  1.0  and  compared  to  If 

Xj  is  larger  than  P(5).,  then  an  incorrect  response  is  specified  for  the  Item;  otherwise,  a  correct 

response  is  specified  for  the  item.  Thus,  a  "subject"  with  P(0)j  =  .90  gets  the  item  correct  9  in 

10  times,  and  a  vector  of  item  responses  is  developed  for  each  "subject"  in  each  data  set.  TItese 
response  vectors  are  then  used  to  investigate  the  AIM  linking  procedures. 

Table  2  shows  the  distribution  of  ICC  parameters  for  the  80  items  for  Test  1  (Tl)  and  Test 
2  (T2),  while  Table  3  shows  the  ICC  parameters  for  the  20  anchor  items  wliich  arc  common  to 
both  tests. 

Subjects  from  Croup  1  were  administered  only  the  items  in  Test  I .  and  subjects  from  Group 
2  only  the  items  in  Test  2.  In  order  to  study  the  effects  of  sample  siite,  the  ICC  parameters  were 
estimated  on  four  samples  drawn  with  replacement  as  follows;  250;  500;  1.000;  and  2.(XX).  The 
ICC  parameters  were  estimated  on  these  four  sample  sizes  for  both  groups.  Anchor  ICC  parameter 
values  from  the  four  samples  administered  Test  1  serve  as  the  input  values  for  the  anchor  item 
parameters  to  the  second  test.  This  permitted  the  four  sizes  of  calibration  sample  (250;  500; 
1.000;  2,000)  to  be  varied  and  tried  out  with  the  four  samples  used  to  estimate  the  anchor  item 
ICC  parameters. 


Table  2.  Means  and  Standard  Deviations 
of  the  Generated  Item  Parameters  for  Test  I  (TI) 
and  Test  2  (T2) 


Ten 


Parameter 

Tl 

T2 

a 

1.0564 

1.0452 

oa 

0.2793 

0.2394 

b 

0.0847 

0.0559 

‘'b 

0.8442 

0.8577 

c 

0.1878 

0.2017 

"c 

0.0.542 

0.0474 
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Table  3.  ICC  Panmeten  of  the  20 
Anchor  Itemi  Common  to  Both  Tests 


Numbar 

ICC  ParaaMlar 

a 

b 

e 

1 

,8000 

-I.SOOO 

.1000 

2 

.8000 

-1.3500 

.1000 

3 

1.0000 

-1.2000 

.1500 

4 

1.0000 

-1 .0500 

.1500 

5 

1.1000 

-.9000 

.2000 

6 

1.2000 

-.7500 

.2000 

7 

1.2000 

-.6000 

.2200 

8 

1.2000 

-.4500 

.2000 

9 

1.3000 

.3000 

.2000 

10 

1.4000 

-.1500 

.2000 

11 

1.4000 

.1500 

.2200 

12 

1.3000 

.3000 

.2500 

13 

1.2000 

.4500 

.2000 

14 

1.2000 

.6000 

.2200 

IS 

I.IOOO 

.7500 

.2200 

16 

1.0000 

.9000 

.2000 

17 

1.0000 

1.0500 

.2500 

18 

.8000 

1.2500 

.2500 

19 

.8000 

1.3500 

.2500 

20 

.8000 

1.5000 

.2500 

Mean 

1.0600 

.0000 

.2015 

sp 

.2113 

.9549 

.0453 

III.  RESULTS 

Table  4  shows  the  intercorrelations  between  the  known  item  parameters  and  the  estimated 
parameters.  As  past  research  indicates  (Urry,  1976),  the  correlations  all  increase  with  increasing 
sample  size.  The  correlations  in  Test  1  for  h  and  estimates  of  b  start  high  at  .952  and  increase 
to  an  exceptionally  high  .992.  Correlations  for  a  and  estimates  of  a  begin  moderately  at  .666  and 
climb  to  .869,  but  the  correlations  of  c  and  estimated  c  increase  from  only  .031  to  .115.  In  Test 
2.  much  the  same  pattern  is  observed  except  that  the  correlation  of  c  and  estimated  c  increases 
from  .164  to  .315  as  sample  size  increases. 

Because  correlations  are  insensitive  to  constant  differences  as  might  be  found  if  ICC 
parameters  are  overestimated  or  underestimated  by  a  constant  amount,  summed  absolute  deviates 
of  the  estimated  parameters  from  the  known  parameters  were  computed  for  each  parameter  in 
each  sample  size.  Table  5  presents  the  summed  absolute  deviations  (or  summed  er'ors)  for  both 
tests  with  the  four  sample  sizes.  Figure  2  displays  this  graphically.  There  is  a  large  drop  in 
summed  error  when  the  a  parameter  is  estimated  on  progressively  larger  samples  of  subjects  up  to 
and  including  the  difference  between  1,000  and  500  subjects.  Between  1,000  and  2.000  subjects, 
the  difference  in  summed  error  is  smaller.  The  relationship  between  error  and  sample  size  for  the 
b  parameter  is  more  nearly  constant.  That  is,  the  line  on  the  figure  for  estimates  of  h  is  generally 
straight  which  means  error  tends  to  be  reduced  in  direct  proportion  to  tiic  number  of  subjects. 
The  almost  flat  line  for  the  c  parameter  indicates  that  virtually  no  reduction  of  error  is  occurring 
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Table  4.  Intercoirelations  between  Known 
and  Estimated  ICC  Parameters  for  Both 
Groups  wfth  Varying  Sample  Sizes 


Paramattr 

N 

T«t  1 

T«tl  2 

a 

250 

.666 

.512 

500 

.671 

.725 

1,000 

.831 

.813 

2,000 

.869 

.886 

b 

250 

.952 

.929 

500 

.964 

.962 

1,000 

.980 

.979 

2,000 

.992 

.987 

c 

250 

.031 

.164 

500 

.035 

.109 

1,000 

-.012 

.331 

2,000 

.115 

.315 

Table  5.  Summed  Absolute  Deviations  (£|Error|)  and  Average  Absolute 
Deviations  ( lErrorl)  for  the  Three  ICC  Parameters 
for  the  Two  Tests 


kSi 


30.6450 

22.8090 

15.7490 

15.5980 

23.5050 
19.8600 
17.6890 
1 2.7350 


30.5290 

20.6910 

16.8910 

15.1390 

20.8470 

16.6070 
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Figure  2.  Error  bi  Estimation  of  ICC  Eanmeter. 
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^Hh  increasing  sample  size  for  that  parameter.  The  average  absolute  deviation  for  the  c  parameter 
is  almost  one-third  of  the  entire  range  of  the  parameter  as  the  <•  parameter  is  generally  estimated 

between  .00  and  .30.  However,  past  research  (Ree.  1979)  indicates  that,  even  for  low  ability 

subjects,  the  effects  of  errors  in  the  estimation  of  the  c  parameter  arc  small. 

Summed  deviations  of  known  ICC  parameters  from  the  equated  value  of  the  ICC  parameters 
were  computed  for  the  a  and  b  parameters  for  the  16  combinations  of  calibration  sample  size  and 
equating  sample  size.  Table  6  shows  the  summed  deviations  and  the  per  item  deviation  for  both 
parameters  for  the  16  combinations.  The  equated  a  parameter  shows  large  summed  deviations 
whenever  the  sample  has  been  limited  to  250  subjects  whether  in  the  calibration  or  equaling 
sample.  The  lowest  error  rates  for  the  a  parameter  occur  when  the  anchor  item  values  have  been 
estimated  on  2.000  sulqects.  The  effects  of  the  size  of  the  calibration  sample  arc  noi  so  clear<ui 
When  2,000  subjects  are  used  to  estimate  the  anchor  item  ICC  parameters,  the  magnitude  of  the 
error  is  approximately  the  same  for  all  calibration  sample  a/cs  except  250.  With  increasing 

calibration  sample  size,  the  error  rate  increases  by  some  small  amount  as  indicated  b>  ll-.c  average 
(per  item)  error.  This  is  an  unexpected  result  and  an  explanation  ma\  be  found  in  iIh’ 
relationship  between  the  sets  of  estimated  a  parameters.  If  the  estimated  a  paiameiers  ixere  all 
estimates  of  the  same  value  and  If  the  test  scale  were  iinidimensional.  a  basic  assiimpiion  oi  ilu- 
theory,  then  the  estimated  a  parameters  sliould  be  bnear  iraiisformations  oi  one  an.'ihei  md 

should  be  correlated  1.0.  as  correlations  are  invariant  to  a  Imeai  iianslormaiii'n  Tins  *a.s  n.'t 
found  to  be  the  case,  and  Table  7  shows  the  inicrcorrclation  oi  ihe  osiimaied  j  paumci-v,  i  )nl\ 
the  correlation  between  the  estimate  of  j  calculated  on  I  IXHJ  subiecis  and  ibc  esimiau-  '  i 
calculated  on  2,000  subjects  approaches  this  relationship  Tins  l,ick  .n  lineanis  nas  .iMe  ’  n, 
assumption  of  normality  and  to  the  rescaling  used  in  the  salibraiion  md  Iu-m  >  a\ 

interact  in  such  a  way  as  to  produce  the  anomalous  resulis  fable  x  sliosis  Ou  .m  ■•.ui  i 
estimated  ft  parameters.  All  exceed  .'KX).  and  ihe  summed  deviaiioio  als.  sh..*  ,  o  .ntv  ,  ... 

as  sample  size  increases  for  the  ft  paraineiei  indicaiing  a  \iriii.ill\  . .  t 


Table  6.  Summed  Abaolute  Devatiom  (SlEnori)  and  Avenge  Abulute  Deviatioiis 
(lEnorl)  for  die  a  and  b  Fanmeten  for  Various 
_ Equating  and  Caiibmiing  Sample  Siiei 

Firtmtf 


MumXf  o«  SmVHcW _  _ • _  _ • 
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Table  7.  Interconelations,  Means, 
and  Standard  Deviation  of  the  Estimated 
a  Parameteis*  for  Test  2 
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^Variables  arc  for  the  four  sample  sizes;  250;  SOO:  1,000: 
2,000. 


Table  8.  Intercombtioiu,  Means,  and 
Standard  Deviation  of  die  Estimated 
b  Parameters*  for  Test  2 
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^Variables  are  for  the  four  sample  sizes:  250:  500: 1 .000: 
2,000. 


estimated  b  parameters  from  sample  to  sample.  However,  with  SOO  subjects  in  the  eqiiating 
sample,  a  similar  anomaly  is  observed  which  may  also  be  due  to  normal  assumptions  and  to 
rescaling. 

IV.  DISCUSSION 

The  results  of  the  study  present  new  evidence  of  the  critical  interrelationship  between  item 
calibration  and  equating  sample  sizes  and  the  values  of  ICC  parameters. 

Estimating  and  Equating  a 

For  the  16  combinations  of  calibration  sample  sizes  and  equating  sample  sizes  identified  in 
Table  6,  the  least  deviation  of  estimated  a  from  its  known  value  occurred  with  an  equating 
sample  size  of  2,000  and  a  calibration  sample  size  of  500.  As  mentioned  in  the  previous  section, 
although  the  least  error  between  the  estimated  and  known  a  values  was  expected  with  a  match  of 
2,000  equating  and  2,000  calibrating  sample  sizes,  the  error  actually  increased  very  slightly  with 
increasing  calibration  sample  sizes  beyond  500.  This  discrepancy  apparently  results  from  a 
non-linear  transformation  with  sample  sizes  of  250  and  500  but  tends  toward  linearity  with  sample 
sizes  of  1,000  and  2,000. 

During  equating  procedures,  a  sample  size  >  500  should  be  developed  to  ensure  an 
acceptable  degree  of  confidence  that  the  estimation  of  a  does  not  significantly  depart  from  Its 
“true”  value.  In  the  same  light,  estimation  of  a  suffers  considerably  using  equating  sample  sizes  of 
less  than  500  such  that  equating  samples  of  1.000  or  2.000  are  higlily  desirable  to  minimize  error 
in  estimating  a. 

Estimating  and  Equating  b 

Table  6  also  shows  the  linear  relationship  between  error  and  sample  size  for  the  b 
parameter.  The  h  parameter  is  best  estimated  with  calibration  and  equating  samples  of  2.000  each, 
although  a  calibration  sample  size  of  1.000  with  an  equating  sample  size  of  500  can  be  tolerated 
without  an  appreciable  increase  in  error.  With  all  combinations  of  calibration  and  equating  sample 
sizes, b  is  estimated  quite  well. 
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Estimating  and  Equatmg  c 

The  flat  line  drawn  in  Figure  2,  representing  the  data  from  Table  5,  shows  the  estimation  of 
the  c  parameter  to  be  nearly  insensitive  to  increases  in  sample  size.  As  sample  size  increases  from 
250  to  2,000  subjects,  the  error  decreases  but  only  very  slightly.  With  the  c  defined  as  the  lower 
asymptote  of  the  ICC  and  representing  the  probability  of  extremely  low  ability  examinees 
correctly  answering  an  item,  the  inability  to  estimate  c  with  precision  could  be  disturbing. 
However,  it  has  been  pointed  out  (Lord,  1975)  that  if  a  (0  -  b)  <  -2,  then  the  probability  of  a 
correct  response  is  c.  Therefore,  if  there  are  a  large  number  of  subjects  with  ability  0  so  that  6 
<  -{2/a  -  b),  c  can  be  accurately  estimated.  If  this  requirement  is  not  met,  c  will  be  poorly  estimated. 

A  stable  and  accurate  estimate  of  the  a  and  b  parameters  requires  large  numbers  of  subjects 
over  a  broad  range  of  ability.  The  estimation  of  c  requires  large  numbers  of  subjects  at  very  low 
ability  levels.  This  holds  for  both  equating  and  calibrating  samples;  therefore,  it  is  necessary  to 
administer  test  items,  whether  to  be  calibrated  or  equated,  to  the  largest  samples  available. 
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